Project 01

Distributional Downscaling

Module 2
Labs
Author

Aaron Xu (ax16)

Published

November 10, 2023

using Dates
using MultivariateStats
using Plots
using NCDatasets
using StatsBase
using Unitful
using DataFrames
using Distributions
using StatsPlots
using Turing
using DynamicHMC

1 Exectuive Summary

In this project, my objective was to model hourly precipitation at a single location given the total precipitation of that day. There are many ways to approach this problem. I opted to do this using distributional methods. Initially, I created a generalize linear model that related the hourly 2-meter temperature to a distribution for hourly precipitation. Then I created a linear regression model that related the hourly 2-meter temperature, the hourly 500hPa temperature, and the 500 hPa relative humidity to the distribution of hourly precipitation. The precipitation could then be estimated via sampling from the distribution. If there are concerns about the correlational support between temperature and precipitation, they are valid, because model performance was laughably bad. Quantile-Quanitle plots between training data and model output showed no clear trends, meaning that my models do serve well as forecasting tools. However, my models did perform better than the null hypothesis model, which argues that precipitation can be estimated without any known environmental conditions.

2 Data Import

3 Data Exploration

3.1 Distribution of Hourly Temperature and Precipitation (all ERA5)

Figure 1: Distributions of hourly temperature and precipitation from the ERA5 dataset. a) The distribution of hourly precipitation. The distribution is dominated by zeros. b) The distribution of hourly temperature 2 meters above ground. The data appears slightly left skewed. c) Hourly precipitation against hourly temperature. There is a very weak negative correlation (corr = -0.017)
Figure 2: Distributions of hourly temperature and precipitation from the ERA5 with only nonzero precipitation obervations. a) The distribution of hourly precipitation. b) The distribution of hourly temperature 2 meters above ground. c) Hourly precipitation against hourly temperature. There is a weak negative correlation (corr = -0.10)
Figure 3: Distributions of hourly 500 hPa relative humidity and temperature from the ERA5 dataset. a) Distribution of relative humidity. b) Distribution of 500 hPa temperature. c) Precipitation vs. 500 hPa relative humidity (cor = 0.26). d) Precipitation vs. 500 hPa relative humidity (cor = 0.029).
Figure 4: Distribution of daily total precipitation from the ERA5 model and CPC obvervations. The correlation of data shown in c) is (cor = 0.025).

3.2 Exploring Distribution Fitting

plot_dist (generic function with 1 method)

The distribution of rainfall is not gaussian in any way. We’ve discussed in class that there are a series of distributions we should try fitting. First try a log-normal distribution, then a log pearson III distribution, and a extreme value distribution for last resort. I first fitted the data to a log-normal distribution and it looked pretty good (@lognormal). Unfortunatley, I couldn’t fit a Pareto or a Weibull distribution without getting errors or bogus distributions, so I will stick with a log-normal distribution.

Figure 5: By eye, A log-normal fit describes hourly precipitation fairly well. The parameters for the log-normal distribution are μ=-3.04 and σ=2.39.

I also checked the distribution of precipitation given certain ranges of temperature. I wanted to see if the distribution parameters changed at all, and they did change throught the temperature brackets (@lognormal-cuts .b).

plot_sliced_precip (generic function with 1 method)
i)   LogNormal{Float64}(μ=-2.5389778370028733, σ=2.6045064318088427)
ii)  LogNormal{Float64}(μ=-3.0652755422578033, σ=2.5621262450568993)
iii) LogNormal{Float64}(μ=-2.925626680304606, σ=2.2865827071057523)
iv)  LogNormal{Float64}(μ=-3.748743357580327, σ=1.8380935211035427)
Figure 6: Histograms and fitted distributions.

4 Methods

I created three models for this project. The first was a null model that assumed an uniform distribution of hourly precipitation, such that for each hour of rainfall, any intensity would have the same probability density. The second and third model both assume that probability density of a given precipitation intensity is described by a Log-Normal distribution with parameters \(\mu\) and \(\sigma\). For the second model, the distribution was fit based on Equation 1 and Equation 2, which were based on GLM methods. For the third model, the distribution was fit based on ?@eq-c and ?@eq-d, which were based on linear regression methods. I acknowledge that it may be a strain to claim that model II is a GLM. The coefficients were determined using MCMC with chain lengths of 100 iterations. The chain length was small to maintain reasonable computing times. Parameter estimation was done on 2019 data and results were tested over 2020 data. Since the model is stochastic in nature, I ran the models 100 times over the testing data and collected the summary statistics of the model results.

4.1 Parameter estimation for model II

In model II, both \(\mu\) and \(\sigma\) were assumed to correlate with 2-meter temperature. The mean is estimated with

\[ \mu_i = \alpha+\beta T_{2,i}, \tag{1}\]

where priors for \(\alpha\) and \(\beta\) were both standard normal distributions, and \(T_i\) is observed 2-meter temperature. The standard deviation is estimated with a homologous equation, that is,

\[ \sigma_i = | \gamma+\chi T_{2,i} |, \tag{2}\]

where priors for \(\gamma\) and \(\chi\) were both standard normal distributions. I have chosen to use the identity link function for equation Equation 1 and a non-cononical absolute value for Equation 2. The idenity link function was chosen because \(\mu \in (-\infty,\infty)\) and, the absolute value function was chosen because \(\sigma \in (0,\infty)\).

4.2 Parameter estimation for model III

In model III, I assumed that the \(\mu\) correlated with temperature and humidity at geopotential of 500 hPa (\(T_{500,i}\), \(H_{i}\) respectively), as well as 2-meter temperature. That is,

\[ \mu_i = \alpha + \beta_1 T_{500,i} + \beta_2 H_{i} + \beta_3 T_{2,i}, \] where all coefficient priors, as well as \(\sigma\), were standard normal distributions.

4.3 Using Observed Daily Total Precipitation as a Boundary Condition

If I sampled from a distribution for \(y_i(x_i |\theta)\) twentyfour times, the sum would not be guarranteed to stay within the observed daily precipitation for that time period. To prevent overestimating daily precipitation, I reject any sample of the distribution of \(y_i(x_i|\theta)\) that put the total daily precipitation above the corresponding observation.

4.4 Testing

5 Results

5.1 MCMC Paramaterization Results

Though sample sizes were small, the parameters appear to be converging and rhat values are around 1.0. I am moderatly confident that the MCMC parameterization behaves well.

(a) Montecarlo Markov-chain results for fitting parameters for model II.
Summary Statistics
  parameters      mean       std      mcse   ess_bulk   ess_tail      rhat   e     Symbol   Float64   Float64   Float64    Float64    Float64   Float64     ⋯
           α    1.4376    0.9580    0.3180     8.3109    31.7840    1.0864     ⋯
           β   -0.0153    0.0032    0.0011     8.2433    38.2860    1.0835     ⋯
           γ   -4.3276    0.8355    0.5904     2.1030    21.0246    1.5158     ⋯
           χ    0.0067    0.0029    0.0020     2.0740    21.0246    1.5245     ⋯
                                                                1 column omitted
(b) ?(caption)
Figure 7: ?(caption)

Montecarlo Markov-chain results for fitting parameters for model III.

Summary Statistics
  parameters      mean       std      mcse   ess_bulk   ess_tail      rhat   e     Symbol   Float64   Float64   Float64    Float64    Float64   Float64     ⋯
           α   -0.9685    0.9921    0.1363    54.9279    31.6824    1.0006     ⋯
          β1   -0.0348    0.0067    0.0010    45.6807    48.7252    1.0230     ⋯
          β2    0.0239    0.0016    0.0002    82.1368    54.8565    1.0213     ⋯
          β3    0.0265    0.0072    0.0011    43.5216    78.3393    0.9900     ⋯
           σ    2.2624    0.0411    0.0062    43.5172    34.2306    1.0202     ⋯
                                                                1 column omitted

5.2 Model Comparison

All three models that I made were terrible at reconstructing real observations. The scatter plots show that predictions and observations are quite inconsistent with each other (Figure 8,Figure 9,Figure 10). Furthermore, the high log-likelihood of low precipitation prediction can be directly attributed to the distribution function used. Ideally, high-likelihood values have further spread into higher precipitation predictions. Perhaps this can be done with bayesian techniques that integrate daily total precipitation into the PDF directly, instead of the sampling trick that I used.

Despite the overwhelming sense of poor performance from the qualitative side, Models II and III do have lower mean squarred errors when compared to the null model (Figure 8,Figure 9,Figure 10). This suggests that, even though predictions are pretty much pure guess work, using prior knowledge about the climatology does produce better guess. However, there does not seem to be much difference due to the implementation of the model and the information provided to the models.

I have also collected other summary statistics that provide insight onto model performance such as the total modeled precipitation and the model-observation correlation. The total modeled precipiation is best reconstructed by Model I, which is not surprising since the sampling distribution favors no precipitation intensity. The total modeled precipitation is significantly less the the actual annual total precipitation for models II and III. This is also expected as the log-normal distributions each point is sampled from favors low intensity precipitation values much more than high intensity precipitation.

(a) Summary results for the null hypothesis (Model I). a) Q-Q plot. b) Ensemble statistics. Color bar shows the log-likelihood for each predicted point.
5×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Number Number Number Number Int64 DataType
1 total_precip 1147.01 mm 1144.43 mm 1144.43 mm 1404.99 mm 0 Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}}
2 total_modeled_precip 1146.34 mm 1143.49 mm 1143.77 mm 1404.27 mm 0 Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}}
3 model_likelihood 707.32 365.861 712.44 1011.63 0 Float64
4 model_obs_correlation 0.0960184 0.0668904 0.0946014 0.177868 0 Float64
5 mse 1.45541 mm² 1.18064 mm² 1.44745 mm² 2.07862 mm² 0 Quantity{Float64, 𝐋², FreeUnits{(mm²,), 𝐋², nothing}}
(b) ?(caption)
Figure 8: ?(caption)
(a) Summary results for the GLM method (Model II). a) Q-Q plot. b) Ensemble statistics. Color bar shows the log-likelihood for each predicted point.
5×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Number Number Number Number Int64 DataType
1 total_precip 1147.01 mm 1144.43 mm 1144.43 mm 1404.99 mm 0 Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}}
2 total_modeled_precip 678.167 mm 605.361 mm 677.79 mm 750.639 mm 0 Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}}
3 model_likelihood 9329.05 8869.35 9370.57 9937.11 0 Float64
4 model_obs_correlation 0.124325 0.068071 0.12131 0.258016 0 Float64
5 mse 0.53666 mm² 0.421163 mm² 0.528945 mm² 0.703593 mm² 0 Quantity{Float64, 𝐋², FreeUnits{(mm²,), 𝐋², nothing}}
(b) ?(caption)
Figure 9: ?(caption)
(a) Summary results for the linear regression method (Model III). a) Q-Q plot. b) Ensemble statistics. Color bar shows the log-likelihood for each predicted point.
5×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Number Number Number Number Int64 DataType
1 total_precip 1147.01 mm 1144.43 mm 1144.43 mm 1404.99 mm 0 Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}}
2 total_modeled_precip 747.919 mm 651.62 mm 749.794 mm 880.444 mm 0 Quantity{Float64, 𝐋, FreeUnits{(mm,), 𝐋, nothing}}
3 model_likelihood 9174.37 8680.74 9182.31 9683.36 0 Float64
4 model_obs_correlation 0.180826 0.110423 0.173946 0.375689 0 Float64
5 mse 0.549778 mm² 0.434284 mm² 0.539754 mm² 0.785719 mm² 0 Quantity{Float64, 𝐋², FreeUnits{(mm²,), 𝐋², nothing}}
(b) ?(caption)
Figure 10: ?(caption)

6 Conclusion

In this project I have explored the use of distributional downscaling for precipitation time series. I have also naively implemented an algorithm to digest available observations in an attempt to ground my distributional model. This method has several limitations, some are characteristic of the method, while others are specific to my implementations. In general, distributional downscaling is not suitable for making climate forcasts. Instead, it is good for providing a statistical assessment for a given set of predictors. Unique to my model, is the tendency to underpredict precipitation, which seems similar to dreary bias that we’ve learned in class. In order to improve upon my work, I would need to find a better method for assimilating existing observations during the downscaling process, perhaps with bayesian methods that update probability of certain rainfall using observations and continuity rules.